-
Notifications
You must be signed in to change notification settings - Fork 536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Preventing the loss from being computed when the input token is EOS Token #878
base: main
Are you sure you want to change the base?
WIP: Preventing the loss from being computed when the input token is EOS Token #878
Conversation
Pulling the latest commits from main fork
Pulling from the main repo
Pulling from mosaicml/llm-foundry main
Merging from mosaic main
Pulling from mosaic main
Pulling from mosaic main.
I think having this option is good, some users almost certainly want it. However, I think this should be optional, as I am not convinced it shouldn't learn to predict the token after EOS. I'd expect the model to learn that after EOS (if sequences are joined randomly) it can disregard all context and pick from the distribution of tokens which begin sequences. This is a different distribution than raw unigram frequencies, which are the probabilities it should use when picking a token not conditioned on EOS. Then, if sequences are not joined randomly, as in that TSP NN method, we definitely want to compute loss. |
Thanks for your comment! Yes, what you said makes sense. This is still very much a work in progress, and I just wanted to run some experimental tests initially to sanity check. |
@samhavens should we also add the option to not predict BOS (assuming the previous tok is the end of the previous seq). |
@vchiley for models which have both EOS and BOS, are you saying don't learn that BOS comes after EOS? it isn't worth learning, true, but also... we'll always stop generating at EOS so it wouldn't matter... or am I misunderstanding |
as discussed on Slack, I think that:
|
The model should not be trained to predict the word after the eos_token, because it comes from a different sequence. This PR implements this logic.
TODO: Experimental verification.